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1. Abstract 

We use a non-Markovian coupling and small modifications of techniques from the 
theory of finite Markov chains to analyze some Markov chains on continuous state 
spaces. The first is a generalization of a sampler introduced by Randall and Winkler, 
the second a Gibbs sampler on narrow contingency tables. 



2. Introduction 

The problem of sampling from a given distribution on high-dimensional continuous 
spaces arises in the computational sciences and Bayesian statistics, and a frequently- 



used solution is Markov chain Monte Carlo (MCMC); see 16 for many examples. 
Because MCMC methods produce good samples only after a lengthy mixing period, 
a long-standing mathematical question is to analyze the mixing times of the MCMC 
algorithms which are in common use. Although there are many mixing conditions, 
the most commonly used is called the mixing time, and is based on the total variation 
distance: 

For measures fi with common measurable cx-algebra A, the total variation dis- 
tance between /x and u is 

11^ - u\\tv = sup {fi{A) - i^{A)) 

For an ergodic discrete-time Markov chain Xt with unique stationary distribution 

TT, the mixing time is 



r 



;e)=inf{t: \\C{Xt) - tx\\tv < e] 



Although most scientific and statistical uses of MCMC methods occur in continuous 
state spaces, much of the mathematical mixing analysis has been in the discrete 
setting. The methods that have been developed for discrete chains often break down 
when used to analyze continuous chains, though there are some efforts, such as |28] 



24 



m 



18 , to create general techniques. This paper extends the author's previous work 



26 and work of Randall and Winkler 22 , and attempts to provide some more 
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examples of relatively sharp analyses of continuous chains similar to those used to 
develop the discrete theory. 

The first process that we analyze is a Gibbs sampler on the simplex with a very 
restricted set of allowed moves. Fix a finite group G of size n with symmetric gen- 
erating set R of size m. For unity of notation, label the group elements with the 
integers from 1 to n. We consider the process Xt[g\ on the simplex Ac = {X G 
^"lEgeG^t^] = > 0}- At each step, choose ^ G G, r G i? and A G [0,1] 

uniformly, and set 



(1) X,^,[gr] = {l-\){XAg]+X,[gr]) 

For all other G G set = [/;,]. Let Uq be the uniform distribution on 

Ag; this is also the stationary distribution of Xt. Also consider a random walk Zt 
on G, where in each stage we choose g E G and r E R uniformly at random and 
set Zt-^-i = gr if Zt = g, set Z^+i = g if Zt = gr, and Z^+i = Zt otherwise. This is 
the standard simple random walk on the Cayley graph, slowed down by a factor of 
about n. Let 7 be the spectral gap of the walk Zt, and follow the notation that C{X) 
denotes the distribution of a random variable X. 

Theorem 1 (Convergence Rate for Gibbs Sampler with Geometry). ForT > 56fc+i78 

\\C{Xt) - UgWtv < 14n-'= 

and conversely for T < k, 



7' 



||£(XT)-f/G||TV>^e-'-3n-5 



ni 



This substantially generalizes [2^ and [26j, from samplers corresponding to G = Z 
and R = {1, —1} or i? = Z„\{0} respectively, to general Cayley graphs. In addition 
to being of mathematical interest, this process is an example of a gossip process 
with some geometry, studied by electrical engineers and sociologists interested in how 



information propagates through networks; see 25 for a survey. 

The proof of the upper bound will use an auxilliary chain similar to that found 

in 



22 , a coupling argument improved from 26 , and an unusual use of comparison 
theory from [9]. The proof of the lower bound is elementary. 

The next example consists of narrow contingency tables. Beginning with the work 
of Diaconis and Efron [6j on independence tests, there has been interest in finding 
efficient ways to sample uniformly from the collection of integer-valued matrices with 
given row and column sums. A great deal of this effort has been based on Markov 
chain Monte Carlo methods. While some of the efforts have dealt directly with Markov 
chains on these integer- valued matrices, much recent success, including [llj |21], has 
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involved using knowledge of Gibbs samplers on convex sets in and clever ways to 



project from the continuous chain to the desired matrices 20 . 

Unfortunately, while the general bounds are polynomial in the number of entries 
in the desired matrix, they often have a large degree and leading coefficient; see 



17 . In this paper, we find some better bounds for very specific cases. Like the 



paper 26 , this is part of an attempt to make further use of non-Markovian coupling 
techniques 13 J2] [5] 19 and also to expand the small set of carefully analyzed Gibbs 



samplers 22 23 [7] [8]. In this case, the new techniques are two slight modifications 
of the path-coupling method introduced in [i]. In many path-coupling arguments, a 
burn-in argument is used to show that for most pairs of points in a metric space, there 
is a path along which the Markov transition kernel is contractive acting on any pair 
of points along the path. In this argument, we show that for all paths, the kernel is 
contractive acting on most pairs of points along the path. This type of modification 
seems likely to be useful only on continuous spaces. 

We consider the following Gibbs sampler Xt[i,j] on the space Mn = {X G M^" : 
EtiMhj] = V 1 < J < 2,E?=i^[^,j] = 2 V 1 < ^ < n,X[t,j] > 0} of nonneg- 
ative n by 2 matrices with column sums fixed to be n and row sums fixed to be 2. 
To make a step of the Gibbs sampler, choose two distinct integers 1 < i < j < n and 
update the four entries X(+i[2,l], Xt+i[i,2], Xt+i[j,l] and Xt+i[j, 2] to be uniform 
conditional on all other entries of Xf. Let f/„ be the uniform distribution on M„ 
inherited from Lebesgue measure. Then we find the following reasonable bound on 
the mixing time of this sampler: 

Theorem 2 (Convergence Rate for Narrow Matrices). For T > {31k + 81)?2log(n), 

\\C{XT)-Un\\TV<l5n-'' 
and conversely for T < (1 — A;)r;, log(n), and n sufficiently large, 

\\C{XT)-Un\\TV>l-2n-^ 



3. General Strategy and the Partition Process 

Both of our bounds will be obtained using a similar strategy, ultimately built 
on the classical coupling lemma. We recall that a coupling of Markov chains with 
transition kernel is a process (X^, Yt) so that marginally both Xt and Yt are Markov 
chains with transition kernel K. Although we always couple entire paths {Xt}J^Q and 
{Fij^g, we often use the shorthand notation of saying that we are coupling Xt and Yt. 
In order to describe a coupling, note that for both walks being studied, the evolution 
of the Markov chain Xt can be represented by Xt+i = f{Xt, i(t),j(t), X(t)), where / is 
a deterministic function, i{t),j{t) are random coordinates (either elements of [n] or of 
a group G), and X{t) is drawn from Lebesgue measure on [0, 1]. These representations 
are given in equations ([s]) and ([T]) respectively. To couple Xt and Yt, it is thus enough 
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to couple the update variables i(t)°',j(t)'^,X(t)°', with a G {x,y}, used to construct 
Xf and Yf respectively. 

Our couplings will provide bounds on mixing times through the following lemma 
(see [l5j, Theorem 5.2 - they work in discrete space, but their proof doesn't rely on 
this assumption): 

Lemma 3 (Fundamental Coupling Lemma). // {Xt,Yt) is a coupling of Markov 
chains, Yq is distributed according to the stationary distribution of K, and r is the 
first time at which Xt = Yt, then 

\\C{Xt) - C{Yt)\\Tv < P[r > t] 

In each chain, then, we begin with Xt started at a distribution of our choice, and Yt 
started at stationarity. For any fixed (large) T, we will then couple Xt and Yt so that 
they will have coupled by time T with high probability. Each coupling will have two 
phases: an initial phase from time to time Ti in which Xt and Yt get close with high 
probability, and a non-Markovian coupling phase from time Ti to time T = T1 + T2 
in which they are forced to collide. Unlike many coupling proofs, the time of interest 
T must be specified before constructing the coupling. 

While the initial contraction phases are quite different for the two chains, the 
final coupling phase can be described in a unified way. The unifying device is the 
partition process Pt on set partitions of [n] = {1,2, ... ,n}, introduced in [26] for 
a special case of the first sampler treated here (see that paper for details). This 
partition process contains some information about the coordinates {i(t) , j {t)}J^rp_^ 
used by Yt throughout the entire process, and is the only source of information from 
the future that is used to construct the non-Markovian coupling. Critically, we don't 
use any information about the random variables X{t) used at each step, which makes it 
trivial to check that the couplings constructed in this paper have the correct marginal 
distributions. 

The process {Pt}J=Ti "^ill consist of a set of nested partitions of [n], Pt^ < Pt^+i < 
. . . < Pt = {{1)5 {2}, . . . , {n}}, where we say partition A is less than partition B if 
every element of partition i? is a subset of an element of partition A. To construct Pt, 
we first look at the sequence of graphs Gt with vertex set [n] and edge set {(i(s), j(s)) : 
s > t}. Then let Pt consist of the connected components of Gt- While constructing Pt, 
we will also record a series of 'marked times' T = ti > t2 > ■ ■ ■ > tm and associated 
special subsets S{tj, 1) and S{tj, 2) of [n]. We will set ti = T, and then inductively set 
tj = sup{t : t < tj_i,Pf_i 7^ Pt}. Finally, note that if Pt-i 7^ Pt, the only difference 
between them is that two elements of Pt have been merged into a single element in 
Pt-i- Label the set merged at time tj with fewer elements S{tj, 1), and label the other 
one S(tj, 2). If both sets have the same number of elements, set S(tj, 1) to be the one 
containing the smallest element (this is, of course, quite arbitrary). 

We will be interested in the smallest time r such that Pt-t = [n], a single block. 
From classical arguments (see e.g. chapter 7 of fsj). 
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Lemma 4 (Connectedness). For the Gibbs sampler on narrow matrices, 

P[r > + e)n \og{n)] < 2^"' 

The analogous lemma for the other example will be proved in Section [5} Lemma [7j 

For both of our walks, we will use two types of coupling, the 'proportional' coupling 
and the 'subset' coupling. In both cases, we will set i{tY = i{ty and = j{ty at 
each step. In the proportional coupling, we will also set Xity = X(ty. 

To discuss the subset coupling, we must define the weight of Xt on a subset S C [n], 
which we call w{Xt, S). For the simplex walk, we define w{Xt, S) = J2ses -^A^]- -^^^ 
narrow matrices, we define w{Xt, S) = "^s&s -^t[s, !]• The subset couplings associated 
with subset S G [n], which are defined immediately prior to lemmas 
often set w{Xt+i, S) = w{Yt^i, S). We say that a subset coupling of subset S at time 
t succeeds if that equality holds; otherwise, we say it fails. 

In each case, the coupling of Xt and Yf during the non-Markovian coupling phase 
will be as follows. At marked times tj, we will perform a subset coupling of Xt^, Yt^ 
with respect to S{tj, 1). At all other times, we will perform a proportional coupling. 
This leads to: 

Lemma 5 (Final Coupling). Assume the non-Markovian coupling phase lasts from 
time Ti to T, that Pt^ = {[n]}, and that all subset couplings succeed. Then Xj- = Y^- 

Proof. Let J^t denote the collection of equations w(Xt, S) = w{Yt, S) for all S E Pt. 
We will show by induction that the equations J-'t hold for all Ti <t <T. At time Ti, 
we have w{Xt^, [n]) = w(1ti, N) = 1- By definition of the partition process, if t is not 
a marked time and all equations J-'t hold, then all equations J^t+i also hold. In fact, 
this is true for any coupling of X(tY,X(thy at that step, not just the proportional 
couphng. 

Assume t = tj is a marked time, and that the equations J^t^ hold. Then if J^tj+i 
don't all hold, we must have that either w{Xtj+i, S(tj,l)) ^ w(Ytj^i, S(tj,l)) or 
w{Xt^+i, S(tj, 2)) 7^ w{Yt^+i, S{tj, 2)), since none of the terms in the other equations 
change. By assumption, all subset couplings have succeeded, so w{Xt^^i, S(tj,l)) = 
w(Yt.+i,S{tj, 1)). By construction, w{Xt^+i, S{tj, 2)) = w{Xt^+i, S{tj, 1) US{tj, 2)) - 
w{Yt.+i,S{tj,l)) and similarly for Yt^+i, so w{Xt.+i, S{tj,2)) = w{Yt^^i, Sltj,2)). 
Thus, the inductive claim has been proved. 

Finally, we note that if w{Xt, {i}) = w{Yt, {i}) for any singleton {i}, then Xt[i] = 
Yt[i] for the sampler on the simplex (respectively = for j e {1,2} for 

the other sampler). Since Pt = {{!}, {2}, ... , {n}}, this proves the lemma. ■ 

So, in both cases, showing that all subset couphngs succeed with high probability 
is sufficient to show that coupling has succeeded. 



18 and ^ will 
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4. Contraction for Gibbs Samplers on the Simplex with Geometry 

In this section, we prove a contraction lemma for Gibbs samplers on the simplex 
associated with a group G and symmetric generating set R of G (that is, — R), 
where |G| = n, \R\ = m, and id is the identity clement of G. Wc recall briefly some 
definitions. We write Ac = {X e M.^'\X[g] > 0,J2geG^[9] = If G Ac is a 
copy of the Markov chain, we take a step by choosing g E G, r & R and A G [0, 1] 
uniformly and setting Xt+i[g] = \{Xt[g] + Xt[gr]) , Xt+i[gr] = (1 - \)Xt[g] + Xt[gr]), 
and for all other entries = This walk is closely related to a slow simple 

random walk on the group. In particular, we let Zt E G he the random walk that 
evolves by choosing at each time step a group element g E G and generator r E R 
uniformly at random, and setting Zt+i = Ztr if Zt = g, and Zt+i = Zt otherwise. 

Let K be the transition kernel associated with the random walk Zt. Since R is 
symmetric, the random walk is reversible, so K can be written in a basis of orthogonal 
eigenvectors with real eigenvalues 1 = Ai > A2 > . • • > A„ > —1. Since it is ^-lazy, 

all eigenvalues are in fact nonnegative. Let 7 = 1 — A2 be the spectral gap of K. In 
this section we will show that 



Lemma 6 (Contraction Estimate for Gibbs Sampler on Cayley Graphs). Let Xf, 
be two copies of the Gibbs sampler on the simplex associated with G and R, with joint 
distribution given by a proportional coupling at each step. Then 



E[\\Xt-Yt\\l]<Ane-^*-ii 



Proof. We will construct an auxilliary Markov chain on G associated with X^, and 
compare it to the standard random walk Zt. Let Xt, Yt he two copies of the walk, 
and couple them at each step with the proportional coupling. For h E G, let = 
'^gfzci^tig] — Yt[g]){Xt[hg] — Yt[hg]). We will analyze the evolution of the vector 
St={Sl^...). 
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There are three cases to analyze: h e R, h ^ R and h ^ id, and h — id. Let be 
the (7-algebra generated by Xg and Yg, < s < t. For case 1, we have 

E[Sl,\Tt] = il-- + —)S^ 
n mn 

+ (XtH + Xt[i] - Yt[m] - Yt[7]){Xt[hm] - Yt[hm]) 
+ {Xt[h-H] - Yt[h-H]){Xt[i] + Xt[rt] - Yt[i] - Yt[rt]) 
+ {Xt[h-'rz] - Yt[h-'m]){Xt\i] + Xt[ri] - Yt\i] - Yt[ri])] 

\2 



ieG 

+ - Y^mXtM - Yt[hi\) + [XS] - Yt[i\){Xt[hH] - Yt[hH])] 

- + — )5f + — + — . 

n 2>mn 3mn mn 

2mn 



n omn Smn mn 



and we note that the sum of the coefficients is 1 — 5-^- For case 2, we have 

Smn ' 



A '?m 

E[S^^,\Tt]^il--)S'^+—S^ 

n mn 



+ i E(^^" + + + SI"'") 

r&R 

= (1 - -)s'^ + — + + s:'' + s:''~'' 



where the sum of the coefficients is 1. Finally, in case 3, we have 



mU^t] = (1 - l)Sl' + ^ E E(^*W + - Ytiil - YArz]f 

reR ieG 

= (1 - — )5f + — Ysi 
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and here the sum of the coefficients is 1 + If we rewrite 1/^'^ = ^Sl'^, and otherwise 
= Sf, then we find the following transformations. For case 1, we have 

(2) E[U^^,\T,] = (!-- + -^)U^ + + —Uf 

n 6mn Smn mn 



rh 



For case 2, we have 

--(1- , 

n 2mn 



rh 



(3) E[ui^_,,\T,] = (1 - l)u,' + ^ j](f/r' + + + u, 

Finally, in case 3, we have 

(4) Eiuiun^n-lm'^^X"' 

where the sum of the coefficients is now 1 in all three cases. In particular, the equa- 
tions ([2]) to Q now define a Markov chain on G. From equation (|2]), this random walk 
sends the identity to itself with probability 1 — t^, and to a uniformly chosen element 
of 5* with the remaining probability; Equations ^ and Q describe transitions for 
h & R and h ^ R respectively. Call the transition kernel K. 

Before analyzing the chain, we note that J^ieai-^A'^] ~ ^tW) = 0; so 

0= {J2^xM-YM 

From this calculation, if {v, (2, 1, 1, . . . , 1)) = 0, then {Kv, (2, 1,1,...,!)) = as 
well. By direct computation, vr = ^^(2, 1, 1, . . . , 1) is a reversible measure for K. It 

is also clear that the distribution tt = ^(1, 1, . . . , 1) is the reversible measure for K. 

We are now ready to compare the chains. Recall from |10j that the Dirichlet form 
associated to a Markov chain with transition kernel Q and stationary distribution u 
is given by 

h,geG 

Let S and S be the Dirichlet forms associated with K and K respectively. Then by 
comparing terms, it is clear that S{(j)) > for any and |, ^ < 2. By e.g. 

Lemma 13.12 of |l5j, this implies 7 > ^7. 
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Recall that if (f , vr) = 0, then {K"^v, vr) = as well. In particular K applied to the 
subspace orthogonal to vr has — t- operator norm at most 1 — 7. Thus, we have 
for any v in that subspace 

\\K"'v\\2 < e-L^™J||t;||2 

going back to our original situation, we are interested in the vector (Sf). At time 
0, •S'o'' < 2, and by Cauchy-Schwarz |S'q| < 4. Thus, ||?7o||2 < 4n, and of course 
\Si'^\ < \\S?\\2 < ||f/f lb- So we find that 

E[\Sf\] < 4ne-L'?J 

which is the contraction estimate in Lemma [H ■ 



5. Coupling for Gibbs Samplers on the Simplex with Geometry 

Having shown contraction, we must now show convergence in total variation dis- 
tance. First, the analogue to Lemma |4j 

Lemma 7 (Connectedness for Gibbs Sampler on Cayley Graphs). Let r be as defined 
immediately before Lemma |^ and let 7 be as defined immediately before Theorem [I| 
Then for t > gi^Z+^^^^lM^ have 

P[T>t]< 2n'^ 

Proof. We consider a graph- valued process Gt, where Go is a graph with no edges, 
and vertex set equal to the group G. To construct Gt+i from Gt, choose elements 
g & G and r E R uniformly at random, and add the edge {g,gr) if it isn't already 
in Gt- We note that r > t if and only if Gt is not connected, so we would like to 
estimate the time at which Gt becomes connected. 

First, fix two elements x,y E G. We'd like to see if x, y are in the same component 
of Gt- To do so, let Xt, Yt be two copies of the Gibbs sampler described in the last 
section, with Xq = x, Yq = y- Couple Xt, Yt and Gt by the proportional coupling. 
Then assume x, y are in different components G^, Gy at time t. We would have 

g g<^Cx ^ g<^Cy ^' 

4 

> - 
n 

By Markov's inequality, then, 

P[G.^Gy]<-^E\Y,\Xt-Yt\'] 
g 
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and so, by standard union bound for fixed x over all y, if At is the event that Gt is 
disconnected, 



P[At] < -snpE[J2\Xt-Yt\^\Xo = fi,Yo = iy] 
4 g 

where the last inequality is due to Lemma [6j ■ 

Next, we define subset couplings and discuss success probabihties for this walk. 
Fix points Xt, Yt, subset 5* C [n] and updated coordinates i = i{t) E S, j = j{t) ^ S. 
The next step is to construct a pair of uniform random variables = A(t)^ and 
Xy = X{ty with which to update the chains Xt and Yt respectively. Assume first that 
Y-*[l]+y/[j]^ < 1) choose Xy uniformly in [0, 1]. Then set 

if that results in a value between and 1. Otherwise, choose A^^ independently of Ay, 
according to the density: 

(6) /(a)=c(i-||±||!i,„„m(A)~ 
where = /„ f{X)d\ is a normalizing constant, and 

From the assumption that < 1, it is easy to see that / really is a density 

on [0, 1]. From its construction as a remainder density, it is easy to check that under 
this coupling, A^ is uniformly distributed on [0,1]. If > 1, an analogous 

construction will work. More precisely, in this case choose A^ first, and then choose 
Xy to satisfy equation [s] if the result is in [0, 1], rather than choosing A^ first. If the 
result is not in [0,1], then choose Xy according to its remainder measure, given by 
equation ^ with Xt and Yt flipped and g replaced by g~^. Note that if equation [s] is 
satisfied, then w{S,Xt+i) = w{S,Yt+i). 

For a pair of points (x, y) in the simplex, a pair of update entries {i, j), and a subset 
S C [n] of interest such that i & S and j not in S, we define p{x,y,i,j, S) to be the 
probability that the associated subset coupling succeeds. Then the following lemma 
from 1 26 1 gives a lower bound on this probability: 
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Lemma 8 (Subset Coupling). For a pair of vectors {x,y) satisfying supj \xi — yi\ < 
n~'^ and infj Xj, infj ?/j > , for e > b, we have for all sufficiently large n that 
p{x, y, i,j, S) > 1 — 2n^^^^'^ uniformly in S and possible i,j. 

In general, it is possible to choose x,y,i,j,S so that the probability of success 
is under any coupling, and the lemma is quite restrictive. Having bounded the 
probability of failure when Xt, Yt are close, we must show that they remain close with 
high probability. Define for w G M" and S C [n] the quantity ||f ||i,s = Sse5 I'^HI- 
Then: 

Lemma 9 (Smallness). Let Xt, Yt be coupled as described in Section^ and assume 
that Pt^ = {[n]}, that all subset couplings up to time t have succeeded, and that 
||Xti — ^Tilli < e- Then \ \Xt — < e for every S in Pt. 

Proof. There are two types of coupling to take care of. For a proportional coupling 
between i and j, we note that the error A satisfies: 

A = - Yt^^[^\\ + \Xt+i[j] - Yt+^[J]\ 

= X{t)\Xt[^ + Xt[j] - Yt[i] - Yt[j]\ + (1 - XmXM + Xt[j] - YS] - Yt[j]\ 
< \Xt[^-Yt[i]\ + \Xt[j]-Yt[j]\ 

Since i and j always connect elements of the same set in Pt, this shows that pro- 
portional couplings never increase \ \Xt — ^t||i,5. 

Otherwise, assume that at time t we had a successful subset coupling between 
subsets 5'(1), S{2) along edge i,j, with i in S{1) and j in S(2). Then we note that 

Xi+iW-Fi+i[^]= J2 iYt[s] ~ Xt[s]) 
ses{i)\i 

= y,[z]-X,[z]+ J2 iXt[s]-Yt[s]) 
ses{2) 

and so 

\Xt+,[i] - Ft+iWI < \Xt^\ - Yt\i\\ + - Fi||i,5{2) 
which immediately implies that 

ll-^t+l — ^+1 1 1 1,5(2) < ll-^t — ^1 |l,S(l)US(2) 

Inductively, this shows that — Vt+i||i^5 < ||Xo — Fo||i- ^ 

Related to this, the following lemma from chapter 13 of jl] shows that Xt, Yt rarely 
have entries close to 0: 

Lemma 10 (Largeness). P[infi<j<„ ■Yiii^<t<TYt^\ < n-'] <Tn'~^ 
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This lets us complete the calculation. Assume that the initial contractive phase 
is of length Ti = ^log(?2), and that the second coupling phase is of length T2 = 



8C2 



^ log(n). 

By Lemma 6, ElJ^g^ci-^Tilg] — ^iM)^] < 4n^~*"i, and combining this with 
Markov's inequality and the bound ||V"||i < -\/2n||V||2, we have P[J2geG\^T-i[g] — 
yn[g] \ > n-"] < n^+«-i^i. 

By Lemma 10, P[info<f<2i,3eG — ""^^^l — By Lemma 7, the probability 

that Pjp-^ consists of a single block is at least 1 — 2n^^'-^^. Finally, by lemmas 8 and 9 
the probability that any of the subset couplings fail while info<t<5,c,eG '^tid] > and 
YlgeG \-^Ti [g] ~Yti [g] I ^ " is less than 2n^+^^". By Lemma 5, = unless one of 
the subset couplings fails or r > T2. As written, the sum of the probabilities that one 

of these two events don't occur is thus at most 8n"'^"^"~2'^i _j_ i^3-b _j_ 2^^3-02 -)-2n*+2~'^. 

7 

It is easy to show that 7 > ^ for simple random walk on any Cay ley graph (e.g. by 
naive bounds with Theorem 13.14 of (l5]), which changes the second term to 2rJ~^. 
To come close to minimizing this, if T = 8(7a; + 23)n log(r;,), set 6 = a; + 7, a = 2x + 9, 
Ci = Qx + 20, and X2 = x + 3. For this, find that the probability of failure is at most 
14n-^'. 

Thus, we have shown that for the simplex walk, for t > 8{7x + 23) '"^"^ — 8^^^!^, 

(7) \\CiXt)-CiYt)\\Tv<Un-^ 
which proves the upper bound in Theorem 1. 

6. Lower Bounds for Gibbs Samplers on the Simplex with Geometry 
In this section, we prove lower bounds on the mixing time of the Gibbs sampler on 



the simplex. The results are similar to those of 22 , though the method is different 
and elementary. Begin by calculating 

E[x,+i[g]\x,] = (1 - -)x,[g] + --Y^ liMg] + Mgr]) 

reR 

= a--)Xt[g] + --J2^^^9r] 

r£R 

In particular, let K be the transition matrix on G given by K[g,g] = 1 — ^, and 
K[g,gr] = — for r E R. This is the standard 'edge'-based random walk on G with 
generating set R described above. This calculation shows that -EfXj] = K^Xq. Note 
that this is not the same K as was used earlier in the section while proving the upper 
bound on the mixing time. By the earlier assumptions on R, K is reversible with 
respect to the uniform measure on G. Furthermore, it is orthogonally diagonalizable 
with real eigenvalues 1 = /3i > /32 > • • • > /3n > 0. 
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Next, let V be an eigenvector of K with eigenvalue ^2, normalized so that | |v| I2 = 1 
and 11^; — Xt'||2 = 7, the spectral gap of K. Let 11 be the collection of vectors with 
nonnegative entries summing to 1, and let w G 11 maximize the inner product {v,w) 
among such vectors; such a vector exists by the compactness of 11. Let Xt be a copy 
of the Markov chain begun from Xq — w, then E[{Xt.,v)] = (1 — ^Y{w,v). On the 
other hand, if At^a is the event that {Xt, v) > d, 

E[{Xt,v)]^E[{Xt,v)lAj+E[{Xt,v)lA'i^} 
<{Xo,v)P[At,a] + d 



where the second inequality takes advantage of the maximality of Xq. Thus, 

' - {Xo,v) 
putting the two inequalities together, 

P[AtA > (1 - 7)* - 



The next step is to prove that {Xq.v) > Let P C [n] be the collection of 

indices so that v[p] > for p G P. Without loss of generality, assume XlpeP'^p ^ ^■ 
Now set = X^pgp Vp < ^/n. Then consider the distribution given by /ip = Xvp for 
p E P, and fjLp — ior p ^ P: 

(/X, v) = \^vl 
^ 2 

and so 

P[AA > (1 - 7)* - IdV^ 

Now, let Y G A(3 be chosen according to the uniform distribution. Then E[{Y, v)] = 
0, and 

E[{Y,v)']=E[{J2y[^]v^?] 

ieG 

= j2v^E[Y\ir]+ j2 v.vjE[Ymj]] 

ieG i¥=jeG 
2 
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So, by Chebyshev's inequality, P[(F, f) > d] < -^j^- Putting this together with the 
inequahty above, letting d = n~e and defining P[Aoo,d] = limj^oo 



P[At,d] - P[Aoo,d] > 7* - 3n 3 
And the lower bound follows immediately. 

7. Contraction and Narrow Matrices 

We begin with some quick observations about the geometry of our space. It is the 
part of an (n — l)-dimensional affine subspace of M^" that lies in the upper orthant. 
Our updates are in fact moves along 1-dimensional pieces of this subspace, even 
though we are updating four entries. While the original motivation for this sampler 
comes from statistics (see e.g. [6]), it is being treated here primarily as an example of 
a chain that is somewhere between the standard Gibbs sampler on the simplex and 
an analogous Gibbs sampler on doubly-stochastic matrices or Kac's famous walk on 



the orthogonal group. The former was analyzed by the author in 26 , using a simpler 
non-Markovian coupling argument. Matching bounds on Total variation mixing time 
are not known for either the Gibbs sampler on doubly-stochastic matrices or Kac's 



walk. The best such bounds to date can be found in 27 and 14 respectively. Both 
mixing bounds are polynomials with small but probably incorrect degrees, and both 
are based on much more complicated non-Markovian coupling arguments. 

In this section, we will prove contractivity estimates for the Gibbs sampler on 
narrow matrices. The work will be done in a combination of the metric, \\Xt — 
ytWl = EtiiMhl] - Yt[i,l]Y, and the metric, \\Xt - Y^\\^ = Y.ti \Xt[hl] - 
Yt[i^ 1]|. The following main estimate will be proved by a sequence of lemmas: 

Lemma 11 (Weak Convergence on Narrow Matrices). If Xt and Yt are coupled under 
the proportional coupling until time Ti = [lOk + 10.5)n log(r;,), then 

P[\\Xt,-Yt,\\i >e]< 3e-^n-^ 

We begin with a non-rigorous description of the proof strategy for this lemma, 
which takes the rest of this section. The lemma is a contraction result, and it will 
be proved using a variant of the path-coupling argument introduced in [i]. In path- 
coupling arguments, the goal is to couple X^ and Yt by constructing an interpolating 
chain, Xt = Zf\ Zf\ zf''^ = Yt so that d(Xo, Fq) ~ J27=i diZ^o'^\ ^f) for some 
metric d. We would then show that, in general, E[ci(zf^'~^\ Z^^)\ < a^d{Z^^~^\ zj^'^) 
for some < a < 1. In most coupling arguments, we find such an a that holds for 
all pairs Zt''\ z[^^^^ associated with a typical pair Xt and Yt, this immediately gives 
an estimate of E[d{Xt, Yt)] < a* d{zl^~^\ 4^'^^) ~ a*ci(Xo, Fq) for most starting 
pairs Xq,Yq. This is converted into a bound on all starting pairs by adding a small 
period, known as the 'burn- in', to remove any bad features of Xq and Yq with high 
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probability. We will also need a burn-in period, and divide T into an initial burn-in 
of length Ti and a contractive period of length 2T2. 

In our argument, we show a contraction estimate only for most pairs Z^/~^\ Z^^ in 
an interpolation between typical chains Xt^Yt- While this generally causes arguments 
for chains over finite spaces to fail completely, it leads to only slightly worse bounds 
for this and (we conjecture) other chains on continuous spaces. To be more precise. 
Lemma 12 gives an contraction coefficient of (1 — ^) for sufficiently nearby points. 
On the ot 



ler hand, inequality ([13j) indicates that under the proportional coupling, 
E\d{X2t-,y2t)\ < C'(n)(l — J^)* for some constant C(n). In particular, the global 
contraction estimate and the local contraction estimate asymptotically differ by a 
factor of only 2. 

Having discussed the big picture, we will now begin the proof by making some basic 
remarks about the chain, beginning with an alternative description of the transition 
probabilities. Define 5t[2, j] = 2 - Xt^, 1] - X^fj, 1], and e^fi, j] = 2 - Yt\i, 1] - 1]. 
Then a step of the chain can be defined in the following way. Choose z,j as before, 
and choose A = f/[0, 1]. If 5t[«, j] > 0, then we update according to: 

=A(2-5i[2,j]) 

(8) Xi+i[j,l] = (l-A)(2-54z,j]) 

X,+i[z,2] =2(1-A) + A5i[^,j] 
Xi+i[j,2] =2A + (1-A)(5t[^,j] 

If 5t[i, j] < 0, we update according to: 

Xi+i[z,l] = 2A-(l-A)54z,j] 

(9) Xi+i[j,l] = 2(l-A)-A(5,[2,j] 

Xi+i[«,2] = (l-A)(2 + 5i[2,j]) 
Xi+i[j,2] = A(2 + 5i[2,j]) 

Note that in both cases, a larger value of A means a larger value of 1]. We 

are now ready to describe the proportional coupling: as in the simplex case, we choose 
the same value of A for both chains in the above representation. 

We will need two initial contractivity lemmas, dealing with contractivity for 
nearby points and a general but poor bound in . The first is: 

Lemma 12 (L^ Contractivity). If Xt, Yt are coupled under the proportional coupling 
for time < t < Ti, and 1a is the indicator function 6t[i , j]et[i , j] > for all 1 < 
i,j <n and all < t < Ti, then 

E[\\Xt, - Yt.WIIa] < (1 - ^f'Uo - YoWl 
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Proof. We begin the proof by calculating the change in the norm during a single 
move. Let Ft[i, j] be the event that coordinates i,j are updated at time t. We find: 



At = E[{Xt+,[i, 1] - Yt+,[i, l]f + {Xt+,[j, 1] - Yt+,[j, l]f 

+ 2] - 2]f + 2] - y,+i[i, 2]f\FSJ]] 

= 2£;[(A(2 - 5t[i,j]) - A(2 - eSJ])y + [XS^J] - Xet[i,j]f\FS,j]] 
4 



It would be nice to calculate the sums of terms like {et[i,j] — 6t[i,j]y in terms of 
sums of terms like j] — Yt[i,j]y. Fortunately, as in the simplex case, these are 
easy to relate. We first note that 



n 

i=l 

n 

= 1] - yt[i, 1])' + 2 J2iXt[i, 1] - Yt[i, l]){Xt\j, 1] - Yt\j, 1]) 

i=l i<j 



If St[i, j]et[i, j] > for all i,j, we can write the first line of the following computa- 
tion: 



Y.(^t[^,J] - et[t,j]r = 1] + X,[j, 1] - Y^, 1] - Y,[j, l] f 

= 1] - yt[^, 1])' + im 1] - Y,[j, i]f 

+ 2(Xt[i, 1] - Yt[i, mXt[j, 1] - Yt[j, 1])] 

2 n 

(10) ^{n-2)J2J2i^t[^,J]-Y,[^,J]r 

j=i i=i 
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Then the final contraction is given by: 

At = E[\\Xt+i-Yt+^\\l\Xt,Yt] 

^ 2 n 

= ;^7;73iy E E E ^[(^m[m, k] - y,+i[m, k]rmj]] 

^ ' k=l m=l ij^j 

^ 2 n 

fc=l m=l i,j^m 

<(i-l)\\x.-YM 

where the last line is due to equation ([t]). On the set A, iterating this inequality 
over t immediately implies the estimate in the statement of the lemma. On A'^, the 
expectation is of course exactly 0. ■ 

Lemma 13 {L^ Contractivity). If Xt, Yt are coupled under the proportional coupling 
for time < t < Ti, then 

(11) I l-^Ti — ^1 1 |l < 1 1-^0 ~ ^o| |l 

Proof. Considering the cases 6t[i,j]et[i,j] > and j]et[i, j] < 0, this follows im- 
mediately by induction on t from applying the triangle inequality to the formulae ^ 
and g. ■ 

Having demonstrated contractivity for 'nice' pairs Xt and Yt, we must now look at 
'typical' pairs Xt and Yt. The following burn-in lemma shows that, after a moderate 
number of steps, Xt and Yt are unlikely to be too close to the boundary of our convex 
set. 

Lemma 14 (Burn-in). For any starting position Xq, 

P[miXt[i,j] < n-%P[snpXt[i,j] > 2 - n'''] < n~7?u4^+6-5 ^ 

Proof. Our proof will be via comparison to a Gibbs sampler on the simplex, studied 
by the author in |26||. Let Xt be a copy of our Gibbs sampler on 2 by n matrices, and 
let St be a Gibbs sampler on the simplex A„ = {S E ]R"|S'[i] > 0,^"^j^S'[i] = 1}. 
To make a move in this Gibbs sampler, choose distinct coordinates 1 < i, j < n and 
< A < 1 uniformly at random, and update entry St[i] to A(S't[i] + St[j]) and entry 
5*4^] to (1 — A)(S'4^] + 'S'ffj]), keeping all other entries fixed. This is identical to the 
other sampler in this note, with generating set R = G\{id}. 
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Since ^j5'o[i] = 1, for any given Xq it is possible to choose a corresponding 5*0 
such that Xo[i, 1] > So[i] for all i, without the row sum condition interfering. Next, 
under our descriptions there is a natural proportional coupling of Xt and St, given 
by always choosing i,j and A to be the same. We claim that under this coupling, 
Xt[i, 1] > St[i] for all t > and all 1 < z < ra. Assume inductively that this holds 
until time t, and that coordinates i,j are updated at time t. Using the representation 
in g and Q 

Xt+i[t, 1] > Xmm{Xt[i, 1] + Xt[j, 1], 2) 
>Xmm{St[7] + St[j],2) 
= \{St\^] + St[j]) 

Let S be drawn from the uniform distribution on the simplex. Then the above 
monotonicity tells us that 

P[MXt[i,l] <n-''] < \\C{St) - U\\tv + P[i^f S\i] < n-''] 



From 



26 



\\C{St) - U\\tv < n~^^W+^-^ and P[mUS\i] < n-''] < This 

gives a good bound on P[m{i Xt[i, 1] < Since all rows sum to 2, this gives the 

same bound on P[supj^Xt[i, 2] > 2 — n~^]. Since there is symmetry between the top 
and bottom rows, this completes the proof. ■ 



To analyze the second part of the coupling used to prove Lemma 11, we create an 
interpolating sequence between and Iti, for some Ti = conlog(?T,) large enough 
that the burn-in lemma has taken effect. Since our sample space is convex, this will 
be simple. We define Xt, = Z^^, Z^^, . . . , Zi^^ = Yt, so that \\Zp^ - is very 

small, and so that all of the are in order along the line between Xj--, and l^i- 
That is, Z^_^ is closer to Xt, than Zi^^ if i < j. For the following lemma and later use 
in this paper, we define 6^[i,j] =2 — ZJ"[i, 1] — Zl^[j, 1], analogously to the definition 
of in the preliminary calculations. We also define D[Z^^ , Z^^] = ■ 

Sr%j]Sr%j]<0}\- Then: 

Lemma 15 (Interpolating Sequence). The interpolating sequence described above 
satisfies: 

(1) |{m:D[Z?J,Z-+i]>l}|<n2 

(2) min(XTji, j],lTi[«, j]) < Z^,[i,j] < max(XTjz, j], IrJ^, j]) for alli,j,m. 

Proof. Part (2) follows from the fact that is on a line between Xt, and Yt, , and 
hence all coordinates are between those of Xt, and Yt,. For part (1), we observe that 
^T\[hj] changes sign at most once as m changes, for any fixed pair i,j. The inequality 

2 

follows since the number of pairs i,j is less than ■ 
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We are finally ready to show that that adjacent pairs in the interpolating sequence 
get closer, when the entire sequence is run under the proportional coupling. Let 
y4[m, t] be the event that D[Z^, Z^] = for s < t. Then by lemmas 7 and 8, we have 

2 

3n' 



HW^t - \\2^A[m,t]\ S {i- - —) \\Zq -Zq 



W^t ~ 111 ^ 11^0 ^ ^0 111 

which, combined with the Cauchy-Schwarz bound < -\/2n||y||2, tells us that 

for n > 2, 

(12) - Zr^lM < n(l - ^%Z^ - Z^^'W^ + P[A[m,2t]1||Z- - Z^^'W 

So, it remains to bound P[y4[m, To do so, let m. A;] be the event that j] 
and ^r^^[i,j] are all at least n'^ away from and 2, and that D\Z'^,Z'^^^\ = 0. 
Then we find that 

Lemma 16 (Chamber Occupancy). P[D[Z^-^, Z^^l\^] > l\G[t,m,k]] < n^+^\\Z^ - 

^0 111- 

Proof. To prove this, just note that at the next move, P[5j!|i]^[2, j] < 0] is 

bounded above by the ratio of the distance between Z^^ and ZJ^-^ to the total 
range that can be travelled. But the former is bounded above by \\Z'^ — 
and the latter is bounded below by This, combined with a union bound over all 
"'-""^^ pairs of distinct i,j gives the bound. ■ 

Thus, using a union bound for Lemma 16 over time from to t < n'^, as well as 
Lemma 14, we find that for any k > 2, P[A[m,tY] < 2\\Z^ - + n^-'' 

after a burn in period of at least Ti = {7k + 6.5)?t, log(n). Putting this together with 



inequality (12), we find that after the burn-in period. 



2 

rpWl'ym rym+l ll ] ^ ^ \t\\ rym rym+lw 

^[\\^Ti+2t ~ ^Ti+2t\\i\ ^ "-U ~ 7^) II^Ti ~ ^Ti I|2 

+ 2p^ -Z™+i||i(n^+^||Z™ -Z^+^lli + n^-^) 
By the triangle inequality, \\X2t - Y2t\\i < Emi W^Tt - Z'^t~^^\\i, so 

2 



E[\\XT,^2t - >T.+2*||i] < nil - -y 5^ p™ - 

m=l 

l-l 

+ 2^||ZJJ -Z^+^||i(n^+'=||Z^ -Z™+^||i+n 



m=l 
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We note that this inequahty holds for any choice of k, I possibly depending on t. 
In particular, if T2 = |(A + l)nlog(n), then choosing k = A + 3 and / = n^'^'^ gives 

(13) E[\\Xt,+2T, - Yt,+2T,\\i] < Sn-^ 

Finally, combining this with Markov's inequality proves Lemma 6. 

8. Coupling for Narrow Matrices 

In this section, we show that subset couplings are likely to succeed, and finish the 
proof of Theorem 2. The main lemma is: 

Lemma 17 (Coupling for Nearby Points). Fix a > b + 3 and b,c > 0. Let Xt, Yt be 

two copies of the chain constructed as above, so that after a burn-in period of length 
Ti = 7(6 + 7.5)?2log(n) during which Xt, Yt evolve by proportional coupling we have 
I I^Ti - IrJ |i < n-", then fort>Ti + {\ + c)n\og{n) we have \\C{Xt) — C{Yt)\\Tv ^ 

Construct a partition process from time Ti to time T = Ti + (^ + c)?T,log(n). Our 
first step is to define subset couplings and show that if Xt and Yt are very close to each 
other and not too close to certain hyperplanes, then any subset couplings are likely 
to succeed. To define a subset coupling of Xt and Yt, fix the subset S of interest and 
common update variables i = i{t) G S and j = jit) ^ 5*. If jje^^, j] > 0, then the 
coupling of and \\ is exactly as described for the other walk immediately before 
Lemma 8] Otherwise, assume < and et[z,j] > 0, and choose from [0,1] 

uniformly at random. Then set \\ to be the number which satisfies w{Xt+i,S) = 
w(Yt+i, S) if such a number exists and is in the interval [0, 1]. Just as with equation 
([5]), the measure (with mass less than 1) on A^ that this assignment defines minorizes 
the uniform distribution, and so leaves a remainder distribution analogous to that 
given in equation (|6|. If there is no value of Af in [0, 1] which would allow w{Xt+i, S) = 
w{Yt+i, S), then choose Xf uniformly from this remainder distribution. If Sy[i,j] > 
and et[i, j] < 0, the same construction works, but with A^ chosen first, and A^ chosen 
to satisfy w{Xt+i, S) = w(Yt+i, S). 

Let p{X, Y, i, j, S) be the probability that a subset coupling of X, Y associated with 
subset S works given that coordinates i,j are updated. The proof of the following 
lemma is nearly identical to the proof of Lemma 4 in [26] : 

Lemma 18 (Subset Coupling). Fix a — 2 > b > 0. For a pair of matrices {x,y) 
satisfying sup„ \x[m, k] — y[m, k] \ < n~°- and 'm.im,k{,x[m, k],y[m, /c], 2 — x[m, k],2 — 
y[m,k]) > n~^, we have for all sufficiently large n that p{x,y,i,j, S) > 1 — 4n^+^^°- 
uniformly in S C [n] and pairs i E S , j ^ S . 

Next, as in Lemma 9, note that after a successful subset coupling involving sets 
S and R at time t, we have — < \\Xt — Yt\\i^suR- Thus, if all subset 

couplings until time t have succeeded. 
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(14) ||Xt-Ft||i,A<||XT, -Fi 



for all SePt. 



We are ready to prove Lemma 17 By inequality ( 14 ) and Lemma 18, the probability 
of a subset coupling failing at some time t given all previous subset couplings have 
succeeded is at most 4n*+2~" + 2n^~^. Using a union bound over all at most n — 1 
subset couplings, and then applying Lemma |4} the probability of all subset couplings 
succeeding and the partition process satisfying Pq = {[n]} is at least 1 — 4n'^+^~°- — 
2^4-6 _ ^-c^ gy Lg]2iina 5, this is a lower bound on the probability that = Y^, 
and thus by Lemma 3 an upper bound on the distance of Xt to stationarity. This 
proves the lemma. 

Next, we put together Lemmas 11 and 17. Using the constants in those lemmas, 
we set 6 = c + 4, a = 2c + 7 and A = 3c + 7, to find that 



||£(Xr)-/:(Fr)||TV<13n-'^ 

which is the upper bound in Theorem 2. To prove the lower bound, let r be the 
(random) first time at which all 2n coordinates have been updated. Then fix the 
starting position Xq of the Markov chain and let Hij = {X G ]R^"'|X[i, j] = Xo[i, j]} 
and set H = UijHij. Then P[Xt e H] - Un{H) > P[t > t\. Since only four of 2n 
coordinates are chosen at a time, the classical coupon-collector results in [l2] tell us 
that at time T = \n{\og{n) — c), \K'^{x, H) — vr(i7)| > 1 — exp(— exp(c)) + o(l) as n 
goes to infinity. 
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